---
title: Language Modeling Data Formats
permalink: /docs/lm-data-formats/
excerpt: "Data Formats for Language Modeling tasks."
last_modified_at: 2020-05-02 17:58:53
toc: true
---

For Language Modeling tasks, the input data should be in a text file with one text sample per row. The format is identical for both language model fine-tuning and for language model training from scratch.

## Train Data Format

*Used with [`train_model()`](/docs/lm-model/#training-a-languagemodelingmodel)*

The train data should be in a text file with one text sample per row.

```text
The CoRg system is a system to solve commonsense reasoning problems. The core of the CoRg system is the automated theorem prover Hyper that is fed with large amounts of background knowledge. This background knowledge plays a crucial role in solving commonsense reasoning problems. In this paper we present different ways to use knowledge graphs as background knowledge and discuss challenges that arise. 
Despite the recent successes of deep learning, such models are still far from some human abilities like learning from few examples, reasoning and explaining decisions. In this paper, we focus on organ annotation in medical images and we introduce a reasoning framework that is based on learning fuzzy relations on a small dataset for generating explanations. Given a catalogue of relations, it efficiently induces the most relevant relations and combines them for building constraints in order to both solve the organ annotation task and generate explanations. We test our approach on a publicly available dataset of medical images where several organs are already segmented. A demonstration of our model is proposed with an example of explained annotations. It was trained on a small training set containing as few as a couple of examples. 
We consider intuitionistic variants of linear temporal logic with `next', `until' and `release' based on expanding posets: partial orders equipped with an order-preserving transition function. This class of structures gives rise to a logic which we denote $\iltl$, and by imposing additional constraints we obtain the logics $\itlb$ of persistent posets and $\itlht$ of here-and-there temporal logic, both of which have been considered in the literature. We prove that $\iltl$ has the effective finite model property and hence is decidable, while $\itlb$ does not have the finite model property. We also introduce notions of bounded bisimulations for these logics and use them to show that the `until' and `release' operators are not definable in terms of each other, even over the class of persistent posets. 
Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, we construct a set of common NLP tasks, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories. We then show that the chronological alignment provides a strong supervisory signal that learning-based methods can exploit leading to significant improvements on these tasks. We believe that the unique structure of this corpus provides an important foothold towards making machine story comprehension more approachable. 
We have proposed going beyond traditional ontologies to use rich semantics implemented in programming languages for modeling. In this paper, we discuss the application of executable semantic models to two examples, first a structured definition of a waterfall and second the cardiopulmonary system. We examine the components of these models and the way those components interact. Ultimately, such models should provide the basis for direct representation. 

```


## Evaluation Data Format

*Used with [`eval_model()`](/docs/lm-model/#evaluating-a-languagemodelingmodel)*

The evaluation data should be in a text file with one text sample per row.

```text
There has been a growing concern about the fairness of decision-making systems based on machine learning. The shortage of labeled data has been always a challenging problem facing machine learning based systems. In such scenarios, semi-supervised learning has shown to be an effective way of exploiting unlabeled data to improve upon the performance of model. Notably, unlabeled data do not contain label information which itself can be a significant source of bias in training machine learning systems. This inspired us to tackle the challenge of fairness by formulating the problem in a semi-supervised framework. In this paper, we propose a semi-supervised algorithm using neural networks benefiting from unlabeled data to not just improve the performance but also improve the fairness of the decision-making process. The proposed model, called SSFair, exploits the information in the unlabeled data to mitigate the bias in the training data. 
There is a growing interest and literature on intrinsic motivations and open-ended learning in both cognitive robotics and machine learning on one side, and in psychology and neuroscience on the other. This paper aims to review some relevant contributions from the two literature threads and to draw links between them. To this purpose, the paper starts by defining intrinsic motivations and by presenting a computationally-driven theoretical taxonomy of their different types. Then it presents relevant contributions from the psychological and neuroscientific literature related to intrinsic motivations, interpreting them based on the grid, and elucidates the mechanisms and functions they play in animals and humans. Endowed with such concepts and their biological underpinnings, the paper next presents a selection of models from cognitive robotics and machine learning that computationally operationalise the concepts of intrinsic motivations and links them to biology concepts. The contribution finally presents some of the open challenges of the field from both the psychological/neuroscientific and computational perspectives. 
Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., RoBERTa succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while RoBERTa can compare ages, it can do so only when the ages are in the typical range of human ages; (c) On half of our reasoning tasks all models fail completely. Our findings and infrastructure can help future work on designing new datasets, models and objective functions for pre-training. 
Synthesizing a program that realizes a logical specification is a classical problem in computer science. We examine a particular type of program synthesis, where the objective is to synthesize a strategy that reacts to a potentially adversarial environment while ensuring that all executions satisfy a Linear Temporal Logic (LTL) specification. Unfortunately, exact methods to solve so-called LTL synthesis via logical inference do not scale. In this work, we cast LTL synthesis as an optimization problem. We employ a neural network to learn a Q-function that is then used to guide search, and to construct programs that are subsequently verified for correctness. Our method is unique in combining search with deep learning to realize LTL synthesis. In our experiments the learned Q-function provides effective guidance for synthesis problems with relatively small specifications. 
Many theories, based on neuroscientific and psychological empirical evidence and on computational concepts, have been elaborated to explain the emergence of consciousness in the central nervous system. These theories propose key fundamental mechanisms to explain consciousness, but they only partially connect such mechanisms to the possible functional and adaptive role of consciousness. Recently, some cognitive and neuroscientific models try to solve this gap by linking consciousness to various aspects of goal-directed behaviour, the pivotal cognitive process that allows mammals to flexibly act in challenging environments. Here we propose the Representation Internal-Manipulation (RIM) theory of consciousness, a theory that links the main elements of consciousness theories to components and functions of goal-directed behaviour, ascribing a central role for consciousness to the goal-directed manipulation of internal representations. This manipulation relies on four specific computational operations to perform the flexible internal adaptation of all key elements of goal-directed computation, from the representations of objects to those of goals, actions, and plans. Finally, we propose the concept of `manipulation agency' relating the sense of agency to the internal manipulation of representations. This allows us to propose that the subjective experience of consciousness is associated to the human capacity to generate and control a simulated internal reality that is vividly perceived and felt through the same perceptual and emotional mechanisms used to tackle the external world. 

```

## `train_files` Format When Training a Language Model From Scratch

The `train_files` argument is required when creating a `LanguageModelingModel` for training a language model from scratch. This should be a path to a text file in the same format as train data or it could be a list of paths to text files in the same format.